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Abstract 

Estimating parameters of continuous-time linear birth-death-immigration processes, observed dis- 
cretely at unevenly spaced time points, is a recurring theme in statistical analyses of population dynam- 
ics. Viewing this task as a missing data problem, we develop two novel expectation-maximization (EM) 
algorithms. When birth rate is zero or immigration rate is either zero or proportional to the birth rate, we 
use Kendall's generating function method to reduce the E-step of the EM algorithm, as well as calcula- 
tion of the Fisher information, to one dimensional integration. This reduction results in a simple and fast 
implementation of the EM algorithm. To tackle the unconstrained birth and immigration rates, we ex- 
tend a direct sampler for finite-state Markov chains and use this sampling procedure to develop a Monte 
Carlo EM algorithm. We test our algorithms on simulated data and then use our new methods to explore 
the birth and death rates of a transposable element in the genome of Mycobacterium tuberculosis, the 
causative agent of tuberculosis. 



1 Introduction 



Linear birth-death- imm i gratio n (BDI) processes pro yide useful building blocks for modelin g population dy- 

namics in ecology (INeel.l2006T) . molecular e yolution (IThorne et all 1 199 lb . and epidemiology (IGibson and Renshawl. 
1998b among many other areas. Although Keiding (1975) has extensively studied inference for fully ob- 
served continuous-time BDI processes, more often such processes are not observed completely, posing 
interesting computational problems for statisticians. Here, we use applied probability tools to develop new, 
efficient implementations of the expectation-maximization (EM) algorithm for fitting discretely observed 
BDI processes. 

We assume that we observe one or multiple independent BDI trajectories at fixed, possibly irregularly 
spaced, time points. iHolmesI (120051) proposed an EM algorithm for such discretely observed BDI processes 
in the context of finding the most optimal alignment of multiple genomic sequences. Of course, the EM 
algorithm is not the only way to find maximum li kelihood estimates (MLEs) of discretely observed BDI 
parameters, as demonstrated by Thorne et al.l (119911) . who initially proposed the BDI model for the sequence 
alignment problem. However, iHolmes d2005h ~ argues that the EM algorithm's simplicity and robustness 
make this method attractive for large-scale bioinformatics applications. 

Computing expectations of the complete-data log-likelihood, needed for executing an EM algorithm, can 
be challenging, especially if the complete-data were generated by a continuous-time stochastic process. 
When complete data are generated by a finite state-space continuous-time Mark ov chain (CTMC), these ex- 
pectations can be computed efficiently (ILangel.ll995uHolmes and Rubinl . 120021) . Although the BDI process 
is also a C TMC, th e infini te state-space of the process prohibits us from using these computationally efficient 
methods. IHolmesI (12005b considers a BDI model with the immigration rate either zero or proportional to 
the birth rate. Under this restriction, the complete-data likelihood belongs to the exponential family, which 
means that the expect ed comp l ete-da ta log-likelihood is a linear combination of expected sufficient statistics 
of the complete data. IHolmesI (|2005l) computes these expectations of the sufficient statistics by numerically 
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solving a system of coupled non-linear ordinary differential equations (ODEs). For this restricted immi- 
gration model and for the death-immigration model we develop a new computat ionally e f ficient met hod for 
compu ting the expected sufficient statistics. Our method combines ideas from iKendalll (11948b and lLange 
(|1982T) and reduces computations of expected sufficient statistics to one-dimensional integration, a computa- 
tional task that is much simpler than solving a system of nonlin ear ODEs. W e develop a similar integration 
method to compute the observed Fisher informat ion matrix via lLouisI (|1982T )'s formula. Moreover, for the 
death-immigration model and for iHolmesI (120051) 's sequence alignment model, we derive the expectations 
of the complete-data sufficient statistics in closed form. 

When rates of the BDI model are not constrained, the infinite-dimensional vector of sufficient statistics 
precludes us from applying the above integration technique. Therefore, we resort to Monte Carlo (MC) 
estimation of the expected complete-data log-likelihood. Our MC-EM algorithm requires sampling BDI 
trajectories o yer a finite time -interval conditional on the observed states of the process at the end-points of 
the interval. iGolinellil (|2000T ) previously accomplished this task via re versible jump M arkov chain Monte 
Carlo (MCMC). Instead of this MCMC procedure, we adopt and extend iHobolfhl d2008h 's direct sampler for 
finite state-space CTMCs to simulate end-point conditioned BDI trajectories exactly. While the resulting 
MC-EM algorithm is slower than our integration method for the restricted immigration BDI model, the 
algorithm still performs well on moderately-sized problems. 

We test our two EM algorithms on simulated data. We then turn to a problem of estimating birth and death 
rates of the transposable element IS6110 in Mycobacterium tuberculosis, the causative bacterial agent of 
most tuberculosis in humans. Estimating these rates is an important task in molecular epidemiol ogy, because 



resea r chers use 1S6110 genoty pes to group infected individuals into epidemiological clusters (ISmall et al. 



19941) . iRosenberg et al.l(|2003h use serially sampled IS61 10 genotypes fromM. tuberculosis infected patients 



to estimate IS6110 birth and death rates. These authors proposed an approximate likelihood method to 
accomplish this estima tion. We revisit this problem using our EM algorithm and compare our results with 



Rosenberg et al] (120031) 's approximation. We also examine differences in birth and death rates among three 



main lineages of M. tuberculosis. 



2 The EM Algorithm 



We start with a continuous-time homogeneous linear BDI process {X t } with birth rate A > 0, death rate 
/i > 0, and immigration rate v > 0. We assume that we observe the process, Xt = 0, 1, . . . , at n + 1 distinct 
times, = to < ti < . . . < t n . We denote our data vector by Y = (X to , . . . , X tn ) and the parameter vector 
by = (A, /i, u). In addition to the full BDI model, we consider a restricted immigration model, in which 
we require v = /3A for some known constant /3, and a death-immigration model, where we set A = 0. 
We are interested in computing the MLEs of the parameters, = arg maxg Z D (Y; 6), where 



n-l 



l (Y;0) := £>gpx t .,x t . +1 (ti+i-ti-,0) 



(1) 



is the observed-data log-likelihood and pij (t; 6) = Pg (X t = J\Xq = i), i,j = 0, 1, . . ., are the transition 
probabilities of the BDI process. T hese transition probabilities can be calculated eith er using the generating 
functio n derived by lKendall d 19481) or via the orthogonal polynomial representation of lKarlin and McGregor! 

Despite the explicit algebraic nature of the orthogonal polynomi als, the lat t er me thod can be nu- 
merically unstable and the generating function method is often preferred (|Sehl et all 120090 . Although one 
can maximize the likelihood l a (Y; 6) using standard off-the-shelf optimization algorithms, such generic 
algorithms can be problematic when one needs to analyze multiple data sets without manual tuning of the 
algorithms and when the BDI rates are functions of a high dimensional parameter vector. As an alternative 
to generic optimization, we develop EM algorithms, known for their robustness and ability to cope with high 
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dimen sional optimization, to maximize the observed data likelihood under BDI models. (IDempster et al. 



1973). 



Complete data in our case consist of the BDI trajectories {X t }, observed continuously during the interval 
[to, t n ]. Let l c be the log-likelihood of the complete data. To execute an EM algorithm we need to be able 
to compute Eg> [l c ({X t };6)\Y] (the E-step) and to maximize this expectation over (the M-step). We 
develop separate algorithms for implementing these E- and M-steps for the restricted immigration/death- 
immigration and the full BDI models, because the two classes of models differ in the way the complete data 
collapse into sufficient statistics. 

2.1 Restricted Immigration and Death- Immigration Models 

Since the BDI process is a CTMC, the log-likelihood of complete data is 

oo oo 

l c ({X t }; 9) = - ^2 d(i) [i(X + /x) + u] + ^2 [n i>i+1 log(iA + v) + n M _i log(i/z)] + const, (2) 

i=0 i=Q 

where d(i) is th e total time spe nt in state i and n^j is the number of jumps from state i to state j during the 



interval [to, t n ] (IGuttorpl. 1 1 99 51) . Replacing v with /3A in the above equation, we arrive at the complete-data 



log-likelihood for the restricted immigration model, 

l c ({X t }; A, fj) = -R tn (A + fj) - t n /3X + N+ log A + Nf n log /i + const, (3) 

where the number of jumps up iV t := J2 i>0 fH,i+i, the number of jumps down iV t ~ := ^i>o n i,i-i' m< ^ 
the total particle-time R tn := J t * n X s ds = X)i=o^(*) ^ sufficient statistics. Similarly, if we set A = 
(death-immigration model), then 

l c ({X t }; n, v) = -RtnH ~ t n v + iV+ log v + Nf n log fi + const, (4) 

Equations (f3]) and © show that, for the E-step, the only expectations we need are [/g y := Eg [A^|Y], 
Dq y '■= Eq [Nf | Yj , and Pe,Y '■= Eq [R tn \ Y]. Using the Markov property and additivity of expectations, 
we break these expectations into sums of expectations of the numbers of jumps up and down and the total 
particle time during each time interval [tk, tk+i], conditional on X tk and X tk+1 . Homogeneity of the BDI 
models suggests that in order to complete the E-step of the EM algorithm, we need to be able to calculate 

U i , j (t)=E(N+\Xo = i,X t =j), 

D id {t) = E (N t ~ | X = i,X t = j) , and (5) 
Pij(t)=E(R t | X =i,X t =j). 

Following iMinin and Suchard d2008h . we choose to work with restricted moments 



Uij(t)=E(N+l {Xt=j} \X = i), 

Aj(i) = E {Nfl {Xt=j} \X = i), and (6) 
P id {t)=E(R t l {Xt=j} \X = i), 

that we can divide by transition probabilities Pij(t) to recover the conditional expectations ©. In order to 
compute the restricted moments, we first consider the joint generating function 

Hi(u,v,w,s,t) :=E(u N+ v N ~e- wR s Xt \X = i), (7) 
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where < u, v, s < 1 and w > 0. Partial derivatives of this function, 

dHi(u,l,0,s,t) 



GT(t,s) 
G*(t,s) 



du 

dHi(l,v,0,s,t) 



= W VnPr(7V+ = n,X t 
u=l — ■ 

j=0 n=0 



<9t> 

d-ffi (l,l,io,s,f 



u=1 = E si E nPr (^r = ^ 



j=0 71=0 



En.,!/).s'. 

j=0 

oo 

'^2f)ij(t)s\ and 

j=0 



(8) 



<9w 



w=0 



E- 

3=0 



xdPr(R t <x,X t = j) 



3=0 



are power series with coefficients Uij(t), Dij(t), and —Pij(t) respectively. Therefore, if we can compute 
Gf(t, s), Gj{t, s), and G*(t, s) for every possible t and s, then we should be able to recover coefficients of 
the corresponding power series via differentiation or integration. Numerical evaluation of partial derivatives 
([8]) is straightforward if we can compute finite differences of Hi(u, v, w, s, t). Remarkably, Hi(u, v, w, s, t) 
is available in closed form, as we demonstrate in the theorem below, so one can even obtain derivatives © 
analytically. 

Theorem 1. Let {Xt} be a linear BDI process with parameters A > 0, p > 0, and v > 0. Over the interval 
[0, t], let Nf~ be the number of jumps up, be the number of jumps down, and Rt be the total particle- 
time. Then Hi(u,v,w, s,t) = E (^u N t v Nt e~ wRt s Xt \ Xq = i^J satisfies the following partial differential 
equation: 

d d 

■T^Hi = [s 2 u\ - (A + n + w)s + vfj] -^Hi + v(us - 1)H U (9) 

subject to initial condition Hi(u,v,w,s,0) = s l . The Cauchy problem defined by equation (© and the 
initial condition has a unique solution. When A > 0, the solution is 



Hi(u,v,w,s,t) 



«i-« 2 fEt e ~ A(a2 ~ ai)rt V 



s-a2 1 



«1 — C*2 



s -a 2 -{s- Qi) e - A («2-«i)rt 



ua\)t 



where cti 



A+/i+ui=p^/ (A+/.t+u>) 2 — iXfiuv 



(10) 



2Au 



for i = 1,2. When A = 0, the solution is 



Hi(u, v, w, s,t) = se 



p + w I 



Proof. Our proof, detailed in Appendix A, is a generalization of lKendalll (|1948h 's derivation of the generat- 
ing function of X t . □ 

Having Hi in closed form gives us access to functions Gf, G~ , and G*, so we are left with the task 
of recovering coefficients of these power series. One way to accomplish this task is to differentiate the 

1 d J G+{s,t) 



power series repeatedly, e.g. Uij(t) 



In Appendix C, we demonstrate that for the death- 



I 1 J k s 

immigration model and iHolmesI (|2005r) 's restricted BDI model, these derivatives can be found analytically. 

In general, repeated differentiation of Gf, Gj, and G* needs to be done numerically, making this method 

impractical. Instead, we extend Gf (t, •), G^ (t, •), and G* (t, •) to the boundary of a unit circle in the 

complex plane by the change of variables s = e 2mz (i in this context is the imaginary number not the 

initial state of the BDI process). For example, 



Gf(t,e 2 ™) = '£U lJ (t)e 2 * 

3=0 



I J 2 



is a periodic function in z, which means that Uij(t) are Fourier coefficients of this periodic function. There- 
fore, we can use the Riemann approximation to the Fourier transform integral to obtain 



Uij(t) 



K-l 



G+ (t, e 27ris ) e 



-2iribs 



ds 



^J2 G t (t,e 2mk / K ) e ~ 2mbk / K , 

k=0 



for some suitably large K. The Fast Fourier Transform (FFT) Jkenricil . 1979) can be applie d to compute 
quickly multiple Fourier coefficients ( Lange . 1982 : Dorman et all 12004 : ISuchard et al. , 2008 ). We do not, 
however, use FFT in our algorithm, because for a particular time interval length t, we almost always need to 
compute Uij(t), Dij(t), Pi,j{t) for only one value of j. 

To complete the M-step at the kth iteration of the EM algorithm, we differentiate equation Q with respect 
to A and fi to obtain 



De k ,Y 



and Afc 



Pe h ,Y ' " K+1 Pe k ,Y + Pt n 
for the restricted immigration model. Similarly, we use equation (@} to find 



and u k+1 - Ue - Y 



Pe 



t, 



for the death-immigration model. 
We obtain the observed Fisher information via lLouisldl982b 's formula: 



i Y (§) 



-l c ({X t };0)\Y 



i c ({x t y,e)i c ({x t y,ey\Y 



where l c is the gradient and l c is the Hessian of the complete-data log-likelihood. This requires the cal- 
culation of the conditional cross-product means, E [N^~N^ | Y], E [A^ + i?r|Y], E [A^ _ it/r|Y] , and the 
conditional second moments of N^,N^, and Rt- The derivation of the information in terms of these mo- 
ments is in Appendix B. These conditional second- and cross-moments, as well as, Py and Dy, can be 
computed in analogous fashion to U Y above, using the joint generating function (fTTTb . 



2.2 Full BDI Model 

The complete-data log-likelihood of the full BDI model, up to an additive constant, is 

oo 

l c ({X t }; 6) = - [R tn (A + fi) + t n u] + n i,i+i lQ g ( iX + + lo M N t n ■ 

i=0 



(11) 



Thus the sufficient statistics are {nj j + i}j>o, A^ - , and Rt n . The technique we developed for the restricted 
immigration case does not apply to E (n^+il Y) for all i > 0, because we can not derive the joint generating 
function for state-specific jumps up, rii i+i for i > 0, in closed form. Moreover, even if could calculate these 
expectations, the problem of evaluating the infinite sum in (fTTT) would remain. Therefore, in the E-step of 
the EM algorithm, we resort to Monte Carlo to compute the expec ted complet e -data log-likelihood for the 
full BDI model. We apply the ascent-based MC-EM method of ICaffo et al.1 (120050 . which dynamically 
adjusts the number of Monte Carlo simulations to guarantee the ascent property of the EM algor i thm. In 
order to approximate expectations of the sufficient statistics via Monte Carlo, we adapt iHobolthl (|2008l Vs 
direct sampling method for finite state space CTMCs conditioned on end points to simulating full linear 
B DI proce s s traje ctories conditioned on end points. 

Hobolthl d2008b 's direct sampling is a recursive algorithm that generates realizations of {X t }f =0 , condi- 
tional on Xq = a and Xt = b. Let pi = f Q fi (t) dt be the probability that there is a jump to state i before 
time T, where 



/*(*) 



-Xat 



K,iPi,b(T - t)/p atb (T), 



(12) 
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i is a + 1 or a — 1, A aja+ i = \a + u, and A aja _i = /ia. We start the algorithm by using the probabilities 
(1 — Pa+i — Pa-ij Pa+i, Pa-i) to randomly decide whether not to jump, to jump up, or to jump down. If 
there are no jumps, then necessarily we started with a = b and the algorithm ends; if there is a jump up or 
down we use the inverse cumulative distribution function (CDF) method to simulate the time at which the 
jump occurs using CDFs = Jq fi(x)/pidx, for i = a + 1, a — 1. We compute and invert these C DFs 
numerically using quick to compute transition probability formulae from lKarlin and McGregor dl958h . Af- 



ter simulating the first jump time, t, over the interval [0, T], we repeat this process by setting Xq = i, and 
T to T — t until T — r < when the algorithm terminates. Because we can simulate very rapidly from 
the BDI process without conditioning on the ending state, accept-reject sampling also provides an effective 
method of simulating the conditional BDI process when the outcomes we condit ion on ar e mode rately likely. 



When they are very rare, accept-reject sampling can be exceedingly slow, and iHobolthl (120081 ) 's method is 
necessary. 

In the M-step of the EM algorithm, the complete-data log-likelihood cTTTTt can be written as l c ({X t }; 0) = 
h({Xt}\ /i) + h{{Xt}\ A, u), , making it possible to maximize separately over p, and over (A, u). Maximiz- 
ing over fi, we get 

Ue k ,Y 

Maximizing function (fTTT > over (A, u) is more difficult, as there is no obvious analytic solution. First, we 
convert this 2-dimensional optimization problem into a 1 -dimensional one by noticing that setting the deriva- 
tives with respect to A and v to and then summing the two equations together yields 

-P 0k ^X-Tu + U 0k>Y = 0. (13) 

We can then write, for instance, z^(A) = U ° k '^ rj ^ 6k ' vX , plug this expression into the likelihood function (fTTT t 
and apply the Newton-Raphson algorithm to maximize this function with respect to A. Using the optimal 
value A, we recover i> = v{X). Because our Monte Carlo E-step takes most of the computing time, we 
allow Newton-Raphson to take as many iterations as it needs to converge under a prespecified tolerance; 
Newton-Raphson algorithm generally only takes between 2 and 5 iterations to converge in our experience. 



3 Results 
3.1 Simulations 

To test our methods, we simulate data from a restricted immigration BDI model with A = .07, [i = .12 
and p = 1.2. The parameters are chosen to resemble, but not exactly match, the dynamics of our biological 
example, discussed in the next subsection. We simulate 100 independent processes starting from initial 
states drawn uniformly between 1 and 15. From each process we collect at least two observations. We place 
observation times uniformly between and 30. Table Q] gives some summary statistics for the simulated 
data. 

First, we assume the restricted immigration model and apply our EM algorithm for this model with initial 
parameter values of 0.2 for both A and fj,. We tested other choices of starting values, but the algorithm 
was not sensitive to them. Next, we pretend that we do not know that the data were generated under the 
restricted immigration model and fit the full BDI model to the simulated data using our MC-EM algorithm, 
starting each parameter, A, \i, and u, at 0.2. The results of fitting these two models are shown in Table [2] 
As expected, estimation is more precise for the restricted immigration model. In the full BDI model, v is 
the most difficult to identify unless there are many observations starting very close to 0, so the variance of v 
tends to be large. 
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Value Simulated Data IS6iiOData 



Nnmhpr of Tntprvals 

J. "1 UlllUL'l V 1 A11LV.-.L V til .1 


387 


252 


Avprncrp Tntprvnl T pncrth 


•J 


35 


Niimhpr of Individuals 


100 


196 


Numbpr of Intervals with an Increase 

± > LI 1 1 J L_/w± V/l 111LL/1 V til lj Vv 1111 till lllvl vuljv 


78 


14 


Average Increase cnven an Increase 

flv LUl^V lllUlL/UijL' cl V Ull till 1 1 IV— I \_<tlk>\_< 


1.5 


1 


Number of Intervals with a Decrease 

1. ^ Ulll Uwl VI lllLwl V til l i VV 1 11 1 tl 1 ' \* K, 1 wliJv 


190 


14 


Average Decrease given a Decrease 


2.5 


1.2 


Number of Intervals with No Change 


119 


224 


Mean Starting State 


5.5 


11 


Standard Deviation of Starting State 


3.8 


5.3 


Total Length of Time 


1947 


89 



Table 1: Summary statistics for the simulated and M. tuberculosis IS6110 data. 



A [i v 

Restricted Immigration 0.067(0.052,0.081) 0.12(0.10,0.14) j 

Full Model 0.057(0.039,0.074) 0.12(0.099,0.14) 0.11(0.058,0.16) 

True Value 0.07 0.12 0.084 



Table 2: Results of EM algorithms applied to simulated data under the restricted immigration and full BDI models. 
Reported values are maximum likelihood estimates and 95% confidence intervals. 



3.2 Comparison with the Frequent Monitoring Method 

We compare our EM algo rithm for computing the actual MLE to the frequent monitoring (FM) method 



of iRo senberg et a l.l (120031) for computing the MLE of an approximate likelihood. In the FM method, 
Rosenberg et al.1 (|2003l ) assume that if the starting and ending values of the birth-death process are equal 
for a particular interval, then no jumps occurred in this interval. Further, if the difference between the start- 
ing and ending values is —1 or 1, then exactly one jump up or exactly one jump down must have occurred 
respectively. The authors exclude all observed intervals, for which starting and ending values differ by more 
than one unit. Let i be the starting state for an interval, t the length of the interval, and Aj = i (A + fi). Then 
the correspon ding probabilitie s for th e three possible events are e~ XiU , p (1 — e~ XlU ), and ^ (l — e~ XiU ) 



respectively. IRosenberg et all (120031) use this FM method to estimate rates in what is effectively a multi- 
state branching process, but we will compare the two methods on our restricted immigration BDI model 
with immigration rate constrained to be 0. We again simulate an underlying BDI process using A = 0.07 
and [i = 0.12. To compare the two methods, we generate three different sets of data. In each set, we gen- 
erate observed states of the BDI process at a fixed constant distance dt apart. This distance varies across 
the data sets, taking the values .2, .4, and .6, respectively. We repeat this procedure 200 times and compute 
birth and death rate estimates and corresponding 95% confidence intervals using the EM algorithm and FM 
approximation method. We show box plots of the resulting estimates for A and fi in Figure Q] As expected, 
the FM estimates behave reasonably when interval lengths are small, but the approximation becomes poor 
as we increase the interval length. The FM method always underestimates the parameters since the method 
effectively undercounts the number of unobserved jumps in the BDI process. We also compute Monte Carlo 
estimates of coverage probabilities of the two methods, shown above the box plots in Figure [T] Not sur- 
prisingly, coverage of the 95% confidence intervals computed under the proper BDI model likelihood are 
very close to the promised value of 0.95. In contrast, the FM approximation-based 95% confidence intervals 
contain the true parameter value less than 95% for all three simulation scenarios. 
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Figure 1 : Box plots of birth (left panel) and death (right panel) rate estimates, obtained from 200 simulated data sets 
using the EM algorithm and frequent monitoring (FM) method. The true parameter values, used in data simulations, are 
marked by the horizontal dashed lines. Above the box plots, we show Monte Carlo estimates of coverage probabilities 
that the 95% confidence intervals attain. 



3.3 Mycobacterium tuberculosis 1S6U0 Transposon 



We apply our restricted i mmigration EM algori thm to estimation of birth and death rates of the transposon 
IS6110 in M. tuberculosis dMcEvoy et al.l.l2007l) . A transposon or transposable element is a genetic sequence 
that can duplicate, remove itself, and jump to a new location in the genome. IS6110 is a transposon that plays 
an important role in epidemiological studies of tuberculosis. More specifically, the number and locations 
of IS 61 10 elements in the M. tuberculosis form a genetic signature or genotype of the mycobacterium, 
allowing epidemiologists to draw infere nce about disease transmi ssion when the same genotype is observed 
among patients with active tuberculosis (Ivan Embden et al.l.ll993l) . Such genotypic comparison can translate 
into meaningful epidemiological inference only if the dynamics of IS6110 evolution are well understood. 
Therefore, accurate estimation of rat es of changes of IS6iiO-base d genotypes is critical for using these 
genotypes in epidemiological studies (|Tanaka and Rosenberg! . l200ll) . 

We analyze data from an ongoing population- based study that includes all tuberculosis cases reported to 
the San Francisco Department of Public Health ( Cattamanchi et al. . 200fj) . Our data include patients with 
more than one M. tuberculosis isolate from specimens sampled more than 10 days apart and genotyped 
with IS6110 restriction fragment length polymorphism. We ignore genomic locations of IS6110 and assume 
that the transposon counts are discretely observed realizations of a BDI process, with no immigration; in 
particular, we assume that the patient is not reinfected with a different strain of the bacteria in the period 
between observations. Table[Tjgives summary statistics for the data. 

We plot birth and death rate estimates and their 95% confidence intervals, obtained with our EM algo- 
rithm, in Figure|2](vertical bars labeled "All"). The starting values for the EM do not affect the results. In the 
analysis presented, the EM algorithm was started with parameter guesses of .05 and .05 for A and fi, respec- 
tively, and their MLEs were 0.027 and . 031, r espectively. Our estimate for A is consistent with the estimate 
0.0188 ± 0.0103 from Rosenberg et al.l d2003l) . Although the authors' credible interval for fi overlaps with 
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Figure 2: Point estimates and 95% confidence intervals for birth and death rate of the IS6110 transposable element 
obtained from all individuals (ALL) and obtained by separately analyzing three M. tuberculosis lineages: European- 
American (EU), Indo-Oceanic (IND), and East Asian (EA). 



ours, our estimate for fi is noticeably higher than IRosenberg et al.l (120031) 's estimate of 0.0147 ± 0.00906. 
Note from Tabled] that among the intervals with a decrease, the average count drop was by more than 1; 
there were 3 intervals where IS6110 counts dropped by 2, whereas there were no in terval that experienced an 
increase by more than 1. Thus we would expect our estimate for fj, to increase over IRosenberg et all (|2003l )'s 
approximation, whereas that of A should be similar between the two metho ds. We also point out that we 



analyz e an updated version of the data analyzed by IRosenberg et al.l (|2003l ). Moreover, IRosenberg et al 



(120031) use a slightly more complicated model for IS6110 evolution, which takes into accou nt shifts in trans- 



poson location. Given these differences in the data and the methods, consistency of our and lRosenberg et al 
(120031) 's estimates is comforting. 



3.3.1 Mycobacterium tuberculosis Lineage Comparison 

In addition to estimation of the global birth and death rates, we separately estimate these parameters in each 
of the three lineages of M. tuberculosis observed in San Francisco. Based on genomic sequence similarity, 
M. tuberculosis is divided into six main lineag es: Euro-Amer i can, E ast-Asian, Indo-Oceanic, East-African- 
Indian, West- African I and West- African II ( Gagneux et al. , 200d) . In our lineage-specific analysis, we 
consider 109 individuals infected with Euro-American lineage strains, 54 individuals infected with East- 
Asian lineage strains, and 25 individuals infected with Indo-Oceanic lineage strains. The M. tuberculosis 
lineage-specific estimates and confidence intervals are plotted in Figure 2. Most notably, there appears to 
be a substantial difference between death rates of the Euro- American and East-Asian lineages. Since this is 
a novel result that has implications for monitoring tuberculosis with molecular genotyping, we examine the 
difference in death rates between lineages more closely. 

The number of IS6110 elements is a potential confounder in our analysis, because patients infected with 
Euro- American and East- Asian differ drastically in the number of IS6110 elements at the beginning of the 
observation period. The isolates from the Euro-American lineage have between 2 and 17 IS6110 elements, 
with 41 out of 109 patients having the first recorded IS6110 co unt less than 6, while IS6110 counts vary 
between 6 and 22 for the East-Asian isolates. IWarren et al.l (I2002h suggest that IS6110 genotypes with fewer 
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Figure 3: Low vs high count genotype analysis. Histograms of simulated numbers of intervals and sums of interval 
lengths are plotted for intervals with starting values less than six and greater or equal to six. The vertical dashed lines 
indicate the observed values of the four statistics. 



than six elements have very low rate of change, because in their data, cases with no observed changes in the 
genotype are domin ated by such low-count genotypes. However, according to our linear birth-death model, 
Warren et al.l (120021) 's observation of low-count genotypes evolving slower than high-count genotypes is 
nothing but expected. To demonstrate this, we simulate 1000 datasets using our global birth and death 
rates and observed initial IS6110 counts for each patient. We record the number of intervals with equal 
starting and ending values less than six, no,<6> and equal starting and ending values greater or equal to 
six, no,>6- We also recorded the length sum of both kinds of intervals: io,<6 an d io,>6- In our data, 

4.6 > 2.8 



?! 



obs 



53 and 



171 with n^ 6 //jf bs 



0,<6 



n 



obs 

0,>6 



Q b > 6 , in agreement with lWarren et al. 



(120021) 's analysis. Histograms of simulated values of the four statistics, no,<6> ™o,>6> *o,<6> an d io,>6> shown 
in Figure demonstrate that our birth-death model replicates well the observed dynamics of low-count 
and high-count IS6110 genotypes. We conclude that our data do not provide evidence that evolutionary 
dynamics of low-count genotypes differ from high-count genotype dynamics. Therefore, it is unlikely that 
our estimated discrepancy between death rates of Euro- American and East- Asian M. tuberculosis lineages 
is caused by high percentage of low-count genotypes in the Euro-American lineage isolates. 



4 Discussion 

We propose two implementations of an EM algorithm for maximum likelihood estimation of discretely 
observed BDI processes. When the birth rate is zero or the immigration rate is zero or proportional to the 
birth rate, we show that the E-step of the EM algorithm can be reduced to computing a small number of 
one-dimensional Fourier transform integrals. This makes our method more efficient than iHolmesI (|2005h 's 
strategy, which involve s finding nume rical solutions of non-linear ODEs. Moreover, in Appendix C, we 
demonstrate that for the lHolmesI (|2005h 's EM algorithm and for the death-immigration model, our generating 
function method yields analytic formulae for the expected complete-data log-likelihood. Therefore, for 
these classes of models, our EM algorithm is exact, meaning that neither E-step nor M-step of the algorithm 
requires numerical approximations. 

Our second EM algorithm uses exact sampling of end-point conditioned BDI trajectories . This sampling 
algorithm is a direct extension of Hobolthl (2008)'s algorithm. The key difference is that Hobolthl (2008) 
worked with finite state-space CTMCs with diagonalizable infinitesimal generators. Assuming that the 
spectral decomposition of the generator is available, the author was able to obtain analytic expressions for 
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the CDFs of waiting times until the next jump in the algorithm. Although iKarlin and McGregorl (Il958l) 's 
orthogonal polynomial decomposition of BDI transition probabilities is an analog of the matrix spectral 
decomposition, we were not able to use the orthogonal polynomials to avoid numeric integration in our exact 
samp ling algorithm. Since this nu meric integration is a major bottleneck in the algorithm, any progress in 
using IKarlin and McGregor! (Il958l) 's orthogonal polynomials to speed up the integration step will be worth 
the effort. 

To simplify our presentation, we concentrated on a simple parameterization of BDI processes in terms 
of birth, death, and immigration rates. I n many applications, rates of co ntinuous-time Markov processes 
are modeled as functions of covariates z ( Kalbfleisch and Lawlessl fl985l ). For example, in the BDI model 
we can set logA(0A) = z *^a> log/z(0 M ) = z*0 M , \ogv(Q v ) = z t 6 u , where 6\,0 ^,0 V are parameters of 
interest. Under such log-linear parameterization, the E-steps of our EM algorithms remain intact, making 
our numerical integration methods relevant for potentially high-dimensional statistical problems in volving 
BDI p rocesses. One can invoke standard modificati ons of the EM algorithm , such as gradient EM (|Lange , 
19951) and Expectation/Conditional Maximization (IMeng and Rubinl . Il993l) . to circumvent the analytical 
intractability of the M-step under such complex parameterizations of the BDI model. 

If one works in a Bayesian framework, it is clear that the exact sampling algorithm can also be used to draw 
BDI trajectories from their full conditional distribution. This observation suggests a Bayesian data augmen- 
tation (BDA) MCMC algorithm, where BDI trajectories play the role of missing data. Since independent 
gamma priors on birth and death rates are conjugate to the complete-data likelihood of t he restricted immi- 
gratio n BDI model, the BDA MCMC can be accomplished via a two stage Gibbs sampler (|Tanner and Wong , 
1987b . iGolineliil tod) develops a similar BDA MCMC algorithm, except the author uses reversible jump 
MCMC to sample BDI trajectories. Notoriously slow m ixing of r e versib le jump MCMC suggests that the 
two stage Gibbs sampler should be more efficient than iGohnelli (bOQfjh 's method. Unfortunately, in the 
unrestricted immigration BDI model, the prior conjugacy is lost, diminishing the appeal and simplicity of 
the BDA MCMC. 

Finally, we would like to point out that the generating functions derived in Theorem Q] are useful not only 
for computing expected complete-data log-likelihoods of BDI models. For example, we are not aware of 
analytic formulae for expectations of sufficient statistics that do not involve the ending state of the process at 



time t: E(N+ \ X = i), E(A^ - | X Q = i), and E(R+ \ X 



These expectations, useful for prediction 



purposes, can be obtained analytically from the generating functions in Theorem Q] (e.g. E(iV t 
dHi(u,l,0,l,t) /8u\ u=l ). 



i) 



5 Implementation 



We implemented the EM algorithm and BDA MCMC for restricted and full BDI models in the R package 
DOBAD (Discretely Observed Birth And Death processes), available from CRAN (ICRANl lioiol) 

( |http : / /cran . r-pro ject . org/web/packages/DOBAD/| ). 
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Appendices 



A Proof of Theorem __ 

Here, we prove our main result. 

Proof. We consider a joint measure Vi t j(ni,ri2, x, t) = P(X t = j, = n\, Nf = n 2 , Rt < x\Xq = i). 
For ease of notation, we will let Ajj be the instantaneous rate of transitioning from state i to state j for the 
BDI process and Aj = YljM ^ij • Also, we will let a, = i be the reward rate for R t ; that is, for staying in 



state i for time h, the process R t increases by ih. Following iNeutsI (11995b . we start with 

Vij(ni,n 2 ,x,t) = l{i=j}l{ x > ai t}l{ni=n2=0} e ~ 



+ l {j>^} 1 {ni>i} / y%,i-\[ni ~ l,U2,x - (t - u)dj,u)]e Aj( * ^Xj-ijdu 
Jo 

+ !{n 2 >i} / Vij+i[nx,n2-l,x-(t-u)aj,u]e~ Xj( - t ~ u ' ) \j +1) j du, 



o 

where l/.i is the indicator function. Next, we derive differential equations for the Laplace-Stieltjes transform 

V* j (n 1 ,n 2 ,w,t) = J °° e~ wx dV i j(n 1 ,n 2 ,x,t): 

d 

-^V* j (n 1 ,n 2 ,w,t) = - jwV^(ni,n2,w t t) - \j(X + fi) + u] V ij (ni,n 2 ,w,t) 
+ 1 {n 1 >i}^{j>i} - 1) + v] Vi*-_i(ni - l,n 2 ,w,t) 

+ l{n 2 >l}^(j + l)^J+l( n l' n 2 - l,W,t). 

We now write Hi(u,v,w, s,t) = _V hij(u, v, w, t)s j where hij(u, v, w, t) := ^2 ni! n 2 V* j (ni,n 2 ,w,t)u ni v r 
The functions hij then satisfy 

d 

—hij{u,v,w,t) = - [j(A + \i + w) + v) hij(u,v,w,t) + [X(j - 1) + u] uh i> j-. 1 (u,v,w,t)ly> 1 y 
+ (j + l)fivh ijj+ i(u,v,w,t). 

Using this fact, we arrive at 

d 

—Hi = - s^s i_1 j(A + fj, + w)hij + _V(-z/)/t i)j + -vh i0 + s ^_ s j ~ 1 u^h i j^i 
i>i i>i i>i 

+ 2J s J v(j + l)^hij +1 + + s 2 X] sj ~ 2u U ~ l)A/ijj_i 

i>i i>i 

= - (A + /i + w)s—Hi - vHi + suvHi + va—Hi + s 2 u\—Hi, 

OS OS OS 

which proves that i/j satisfies equation (__ . 

Using the method of characteristics, we solve the above PDE with initial condition Hi(u, v, w, s, 0) = s\ 
When A > 0, the solution is 



-_ot\ p — A(a 2 — ai)rt ' 



i 



rr ( ,x _ ( "l -«2^e _____ ) ( ai-a 2 \* -u(i-u ai )t 

ti i{ u,v,w,s,t)- i i _ ^ e _ A(Q2 _ Ql)rt Ma.^.^.^e-ACa™)^ e 
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where cxj = * — , tor « = 1, 2. In the case of A = (death-immigration model), the 

solution is 

Hl (u,v,w,s,t)= se-^-^ tl) e u^F M^" 1 )'. 

\ H + w J 

□ 



B Calculating the Observed Information 



In this section, we provide details for calculating the observed information matrix. lLouisI (119821) shows that, 
in problems with incomplete observations, the observed information can be calculated as 

i Y (o) = -E g (i(o,x)\Y) - E (i(o,x)i(o,xy\Y) 

where I is the complete-data likelihood for, in our case, either the restricted immigration or the full BDI 
model, I is its gradient, and I is its Hessian matrix. Under the restricted immigration model with v = /3X we 
have 



z"((a,aO,x) 



-R T - |3T + ^,-R T + ^- 



so that 



where A = + 2R T Tj3 - 
^ + ^,andC = ^ 



T/3N+ 



A B 

B C 



f ^ + T 2 /3 2 , B = R 2 T + TR T (3 



N+R 7 



N-R T 



N+N- n „ A n -n2 2RtN£ , A£ )2 _ Hessian { 

a 1 V // / 



The generating function presented in Theorem Q] can be used to compute the conditional means of all the 
needed cross-products and square terms in the gradient and Hessian, similarly to the first moment calcula- 
tions outlined in the main text. 
In the full BDI model, the score function is 



and the Hessian is 



-Rt + J2 



i=0 



i\ + v' 



-R T + ^,-T + Y, 



/ y-voo 



l((X,fi,u),X) 



i=0 



iX + v 



i=0 (iX+u) 2 u Z-Ji=0 (iX+u) 2 



fl 2 

\l^i=0 (iX+u) 2 U l^i=0 (iX+v) 2 / 



Since our generating function method precludes us from computing conditional expectations of some entries 
in the above expressions, we estimate expectations of the score and Hessian via Monte Carlo with the help 
of our direct sampling algorithm, described in the main text. 
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C Special Cases 



In this section, we show that for two important special cases of the BDI model the E-step of the EM algorithm 
does not require any numeric approximations. 



Death- Immigration Model 

We have shown that the generating function, Hi(u,v,w, s,t) = E (u N ^v N t e~ wRt s Xt \Xo = i\, for the 
death-immigration model is 



Hi(u, v, w, s,t) = se 



H + w ) 



Suppose we are interested in computing E (A^ + l{x t=J }|^o = *)• First, we fix v = 1 and w = 0. Next, we 
differentiate the generating function once with respect to u and j times with respect to s, plugging in 1 and 
respectively: 

d <9 J 

E {N+l {Xt=j} \X = i) = Q^Hi{u, 1,0, s, t) 



and 



where 



Hi(u,l,0,s,t) = [l + e-^(s- l)]*e" 



uu(a— l)(e 1 



Ji=l,S=0 



(e"* - l) 



/' 



Therefore, 



where 



9u 



, x uu (e M * — l) 
D(u) = i - + u(u - l)t. 



Hi(u, 1, 0, a, i) = (A + 5s) 4 e^ 1 )^ 1 ) [C"(l)s + D'(l)], 

u=l 



C'(l) 



D'(l) 



(e^ - l) 



/' 



f e 



1) 



+ vt. 
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Now, the derivatives with respect to s can be recovered by expanding J^Hi(u, 1,0, s, t) 



u=i 



into a power 



series: 



d , 

— Hi(u, 1,0, s,t) 



u=i 



.m=0 



I 

m 

i+l 



{A + e^ 1 )^ 1 ) [C'(l)s + D'(l)] = e D « [C'(l)s + £>'(!)] 



oo „ fe 

Ett s * 



jfc=0 



> 771=0 



fe=o 



n=0 



C(i) 



^ it! 

I fe=max{0,n— i— 1} 

' ^- n+fc £ n - fc l {n ,. 

Therefore, 

E{N+l {Xt=j} \X = i)=e D ^ 

fc=max{0,,j— i— 1} 

+7J'(1) 

One can derive expectations of iV t ~ and i? f in a similar fashion. 



At— n+k+1 Tjn— fe— l-i 



fc=max{0,,j— i— 1} 



C'(l) 



j-k-1 



Al—j+k+l nj-fc-ll 
A B L {j~k>l} 



j - k 



Ai-j+kgj-k-^ 



{j-k<i} 



Sequence Alignment BDI Model 



Here we demonstrate that our generating function approa ch results in analytic formulae fo r the E-step in t he 
evolutionary EM algorithm, developed by Holmesl ( 2005 ). This is in contrast to the original Holmes (2005)'s 
implementation , which requires numerically solving a system of nonlinear ordinary differen tial equations. 



Holmes! (120051) 's algorithm is based on a TKF91 model of sequence alignment evolution (IThorne et al. 



199 lh . Instead of diving into the intricacies of this model, we refer the reader to Ian Holmes' web page 
( [http : //biowiki . orq/Tkf IndelModelPathSummaries[ ), where he poses an open problem of 
deriving the E-step of iHolmesI ((2005J)'s algorith m in closed form and explicitly formulates this problem in 
terms of the BDI process. To derive the E-step of lHolmesI (120051) 's algorithm in closed form, using our BDI 
notation, one needs to find analytic expressions of the following expectations: 

1. E (N+l {Xi=j} \X = l),E (Nfl {Xt=j} \X = 1), and E (R t l {Xt=j] \X = l) when v = 0, 

2. E (N+l {Xt=j} \X = 0),E (N t -l {Xt=j} \X = 0), and E (R t l {Xt=j} \X = 0) when v = A, 

We derive the analytic formulae for E {N^l {Xt=j} \X Q = 0) ( v = A) and E (N^l {Xt=j} \X = l) (v = 
0). The other expectations can be derived analogously. 



1. Objective: E (N+l {Xt=j} \X = 0)(u = A): 
First, 



E{N+1 



d & 



{Xt=j} 



|Xo = ) = ^^ H ° (r ' S ' t) 



s=0,r=l 
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where 



H+(r,s,t) 



(ai - a 2 )e- A ( 1 - rai ) i 



and ai 



A + /jTv/(A + /j)'-4A^ 



s - a 2 - (s - ai) e - A ( Q 2-«i)r-t " ^ 2Ar 
We find the formula for this partial derivative by explicit differentiation: 



dsi 



— Hj(r, S) t) 



(-l)J'j! (ai - a 2 ) e" A ( 1 - rai ) i (l - e -K^-on)rty 



dsi 



H+(r,s,i) 



(s - a 2 - (s - ai) e -A(a2-ai)rt) 
(-lpj! (ai - a 2 ) e - A ( 1 - rai )* (l - e -^(«2-°iH) J _ ^( r ) 



where 



d d j 
dr dsi 



ll+(r,s,t) 



(aie- A ( Q2 - ai ) rt - a 2 ) j+1 

A'{l)B(l)-A{l)B'(l) 



=0.r=l 



B 2 (l) 



B{r)' 



= (l - e^)' (l - , 



5(1) = ( e ( A ~M)*_^ 
A 



A' (I) = (l-e^-ri 1 



2. Objective: £ (7V+l {Xt=i} |X = l) = 0): 
As before, 



d &> 



E (N+l {Xt=j} \X = 1) = __H+(r, S ,t) 



s=0,r=l 



where 



, ai( S -a 2 )- a2 ( S -ai)e-^-^ ri 
H7(r, s, i) = r , — = a 2 + 

For j = 0, we just need to plug in s = 0: 



ai — a 2 



H+(r, 0, t) = a 2 + 



a 2 ai — a 2 



a 2 + 



A(r) 



Then 



-H« r ,0, t ) 



a 2 - aie- A ( a 2-"iW ~ z 1 B{r)' 
A\l)B{l)-A{l)B'(l) 



=i A(/i - A) 



+ 



B 2 (l) 
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where 



z?(i) 
A'(l) 
B'{1) 



a _ p(A-/*)* 

A 

_Jf_(l + 2^-^ 
/x - A \ A 2 A 

,2 



+ 



A 



3 (A-jt)t 



A(/x — A) /i — A 



(1 + 2/xi) 



For j > 



0*J 



H+(r, S ,t) = (-iy +1 j! 



A(r) 



C(r) 



a 2 (ai - a 2 ) (l - e^-a^y ^ _ ^ ^ _ e A(a a -«i)r^ J 1 



X{oc2—ct\)rt 



«2 ) 



A(a2— ai)rt 



«2 



B(r) 



D(r) 



Then 
where 



H+(r,s,t) 



s=0,r=l 



A'(1)J3(1) - C"(1)D(1) - C(1)D'(1) 



52(1) 



+ 



1)2(1) 



,(i)4(i- 9 (i- e ^y 



B(i) 



„(A- M )^ i_1 



C7(l)=(l-E)(l-e 



£>(1) 



A'(l) = (l - e (A ^ } * 



j'-i 



/i-AV 1 + 2 A ; 



B'(l) = (i + 1) (e (A "^-^) j 



A 



C"(l) = (l - e (A -^* 
L>'(1) =j(e 



fj, — A 
A 2 + /. 2 
A(/x - A) 
A 



e (A-M)*(l + 2/i t) + 



A(/i-A) 
) + (J - l)e {x -^2fit 
t> 2 



// — A 



e (A-M)t(! + 2 M f) + 



A(/x - A) 
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